Red Wine Quality by Ahmed Ashraf

This report explores a dataset containing quality and attributes for approximately 1599 wine samples with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## [1] 1599   13
str(red_wine)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Our dataset consists of 13 variables, with almost 1599 observations.

Most of samples have a quality of 5 and 6. There is no samples that have quality value less than 3, Also there is no samples that have quality value more than 8.

In the above graph I showed all histograms for all variables in the dataset. We can see how the data is distributed for each variable. But for better visualizations let???s view each graph individually. That will help us customize our graph for each variable.

fixed acidity

fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

It’s a right skewed distribution with a peak at 7 it has a mean of 8.32 and a maximum value of 15.90.

volatile acidity

volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

It’s a bimodal distribution which has two peaks at 0.4 , 0.6. I suppose that high levels of volatile acidity will lead to worse wine quality.

citric acid (g / dm^3)

citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

It’s right skewed with mean of 0.271 and max of 1. I think high quality wines should contain certain amounts of citric acid.

residual sugar (g / dm^3)

residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

It a normal distribution with a long tail. The second graph is the log transformation of residual sugar. The mean value is 2.53 and max goes all the way up to 15.50. There is no values close to 45. But there is a few values less than 1.

chlorides (sodium chloride - g / dm^3)

chlorides: the amount of salt in the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

It’s a long tailed histogram with 0.087 for mean and 0.079 for median.

free sulfur dioxide (mg / dm^3)

free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
##   [1] 11 25 15 17 11 13 15 15  9 17 15 17 16  9 52 51 35 16  6 17 29 23 10
##  [24]  9 21 11  4 10 14  8 17 22 15 40 13  5  3 13  7 12 12 17  8  9  5  8
##  [47] 22 12  5 12  4  8  6 30 33 25  4 50 17  9 19 20 12 13  4  4 11  6 27
##  [70]  8 15 17 18 11 28  9  9 14 12 27  3 22 21 16 18 19 20  9 34  8 42 20
##  [93] 19  9 41 17  8  3  5 13

It’s right distribution with mean of 15.87 and median of 14.00. Most of values are integers.

total sulfur dioxide (mg / dm^3)

total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

It’s right skewed disribution. Most of values are integers which its unit is (mg / dm^3).

density (g / cm^3)

density: the density of wine is close to that of water depending on the percent alcohol and sugar content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The density distribution is normally distributed which has mean of 0.9967 and median of 0.9968.

pH

pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

It’s normaly distributed which mean is 3.311 and median is 3.310. Most of values between 3 and 3.7.

sulphates (potassium sulphate - g / dm3)

sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

It’s a right skwed distribution with long tail. its mean is 0.6581 and median is 0.62.

alcohol (% by volume)

alcohol: the percent alcohol content of the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

It’s a right skewed distribution which mean is 10.42 and median is 10.20. Wine is alcoholic drink. I wonder how alcohol is related to wine quality.

Univariate Analysis

What is the structure of your dataset?

There are 15999 red wine samples in the dataset with 13 features. All of them are floats except quality and X which are integers.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are volatile acidity, alcohol and quality. I???d like to determine which features are best for predicting the quality of a wine sample. I suspect volatile acidity, alcohol and some combination of the other variables can be used to build a predictive model to quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

pH and citric acid likely contribute to the quality of wine. ### Did you create any new variables from existing variables in the dataset? No, I think there is no need to create any variable.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

No, there is no need to do any operations on this data set because it’s tidy data. It seemed wrangled and cleaned. There is some unusual outliers but it seems real values.

Bivariate Plots Section

In the above graph, We see that quality negatively correlated with volatile acidity by 0.4 while it’s positively correlated with alcohol and sulfates by 0.5 and 0.3 respectively.

Generally quality tend to increase when volatile acidity decreased with a negative correlation between them. That’s agreed with our expectations because high levels of it can lead to an unpleasant taste.

In general wines with more alcohol tend to have higher quality values except at quality value of 5.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$sulphates and red_wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Higher quality wines tend to have more sulphates.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$citric.acid and red_wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

The above graph shows that citric acid median increases when quality increases. The correlation between citric acid and quality is 0.226 though being a weak correlation it do effect the quality of wine.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$citric.acid and red_wine$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

Citric acid is strongly correlated to volatile acidity with a value of -0.5524.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$density and red_wine$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Alcohol is negatively correlated with density by -0.45.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$chlorides and red_wine$sulphates
## t = 15.978, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3282127 0.4127694
## sample estimates:
##       cor 
## 0.3712605

There is no strong relationship between sulphates and chlorides. Although they are correlated with a value of 0.37. We can see also that number of outliers increases when sulphates increase.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$fixed.acidity and red_wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

Fixed acidity and density are strongly correlated with a value of 0.67.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • Volatile acidity has a negative correlation with quality by -0.4. Volatile acidity boxplots showed that the median for each increase in quality has a lower value of volatile acidity.
  • Alcohol and quality have a positive correlation with a correlation value of 0.5 which showed that alcohol is positively correlated with quality.
  • Sulphate and citric acid are positivity correlated with quality increase in their values will increase in wine quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes, There is a strong relationship between free sulfur dioxide and total sulfurdioxide which are not contained in my analysis also high correlation between density and fixed acidity was observed.

What was the strongest relationship you found?

The strongest relation is between pH and fixed acidity with a correlation value of -0.68.

Multivariate Plots Section

From the above graphs. We can see that most of wines with quality values greater than 6 have citric acid values greater that 0.25 and alcohol value greater than 11%. Also I data using facet_wrap to show if there is plots overriding.

Most of the lowest quality values have higher volatile acidity. Almost most of quality values lower than 5 have volatile acidity greater than 0.5. While citric acid values vary along with the x axis.

Most of wines with high quality values have alcohol value greater than 11%. So these values tend to have lower density values because alcohol and density are negatively correlated.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

It was expected that high quality wines tend to have more alcohol. Also they tend to have higher citric.acid and lower volatile.acidity.

Were there any interesting or surprising interactions between features?

No, I see there is no surprising interactions in my analysis.


Final Plots and Summary

Plot One

Description One

This from the univariate plots section. It’s an important graph from it we can see how our samples distributed between quality values. We can see most of out samples take quality value of 5 or 6. 13.5% take a quality value greater than 6. 3.9% take a quality value less than 5.

Plot Two

Description Two

This plot from bivariate plots section. Wine is an alcoholic drink. So It’s expected that alcohol has an important effect on quality. We can see in this graph that in general wines with more alcohol tend to have higher qualities.

Plot Three

Description Three

This plot from multivariate plots section.

Most of wines with quality values greater than 6 have citric acid values greater that 0.25 and alcohol value greater than 11%.


Reflection

The red wine data set contains information on almost 1599 thousand wine sample with 11 variables on the chemical properties of the wine. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wines across many variables.

I showed a correlation table between all the variables. It was the most important graph for my analysis. It helped me to restrict my analysis to the most important variables that correlate with each other.

We can see that our dataset has a low number of samples with quality value (3,4) and (7,8). So I think having more samples in general will improve our analysis.

In future work we can do :